Enable model caching for Whisper pipeline on GPU and NPU #2759

luke-lin-vmc · 2025-09-23T08:06:08Z

Whisper sample code to enable model caching on GPU and NPU

This is #2751 follow up

Sample Code Reference:
https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/visual_language_chat/encrypted_model_vlm.py#L87
https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/cpp/text_generation/encrypted_model_causal_lm.cpp#L52

OPTIMIZE_SIZE and encryption are not included. The main performance concern for Whisper is pipeline speed. Since Whisper is much smaller than LLMs, size optimization offers only very little savings while potentially adding latency. Similarly, model encryption can also introduce additional latency.

as-suvorov · 2025-09-23T08:22:06Z

samples/cpp/whisper_speech_recognition/whisper_speech_recognition.cpp

+    ov::AnyMap ov_config;
+    if (device == "NPU" || device.find("GPU") != std::string::npos) {  // need to handle cases like "GPU", "GPU.0" and "GPU.1"
+        // Cache compiled models on disk for GPU and NPU to save time on the
+        // next run. It's not beneficial for CPU.


Why it's not beneficial for CPU?

This comment is simply copied from the reference sample code.

AFAIK CPU plugin's "compile" step is mostly graph rewrites and primitive selection. It’s typically milliseconds–a few hundred ms, not seconds–minutes like on GPU/NPU.

Most importantly, enable model caching on CPU causes Whisper pipeline crashed. This looks like a bug which needs further investigation. So currently model caching is enabled only on GPU and NPU to avoid the issue.

Copilot

Pull Request Overview

Adds GPU/NPU model caching configuration to Whisper speech recognition sample code.

Introduces helper to build caching config in both Python and C++ samples.
Applies conditional logic to enable caching only on GPU/NPU devices.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
samples/python/whisper_speech_recognition/whisper_speech_recognition.py	Adds cache config helper and conditional passing of CACHE_DIR to WhisperPipeline.
samples/cpp/whisper_speech_recognition/whisper_speech_recognition.cpp	Adds cache config helper and conditional AnyMap passed to WhisperPipeline.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-16T11:25:24Z

samples/python/whisper_speech_recognition/whisper_speech_recognition.py

+    ov_config = dict()
+    if args.device == "NPU" or "GPU" in args.device: # need to handle cases like "GPU", "GPU.0" and "GPU.1"
+        # Cache compiled models on disk for GPU and NPU to save time on the
+        # next run. It's not beneficial for CPU.
+        ov_config = get_config_for_cache()
+


The condition handles GPU variants (e.g. GPU.0) but will skip NPU variants such as 'NPU.0', limiting caching despite the PR goal to enable it for NPU. Update the condition to also match NPU suffixed forms, e.g.: if 'GPU' in args.device or args.device.startswith('NPU'):. Alternatively use substring checks for both: if 'GPU' in args.device or 'NPU' in args.device:.

Copilot · 2025-10-16T11:25:24Z

samples/cpp/whisper_speech_recognition/whisper_speech_recognition.cpp

+    ov::AnyMap ov_config;
+    if (device == "NPU" || device.find("GPU") != std::string::npos) {  // need to handle cases like "GPU", "GPU.0" and "GPU.1"
+        // Cache compiled models on disk for GPU and NPU to save time on the
+        // next run. It's not beneficial for CPU.
+        ov_config = get_config_for_cache();
+    }
+
+    ov::genai::WhisperPipeline pipeline(models_path, device, ov_config);


The condition enables caching for GPU variants but misses NPU variants like 'NPU.0', restricting caching contrary to the stated intent. Adjust to also detect NPU substrings: if (device.find("GPU") != std::string::npos || device.find("NPU") != std::string::npos) { ... }.

Wovchena · 2025-10-16T13:02:07Z

build_jenkins

…lkit#2759) Whisper sample code to enable model caching on GPU and NPU This is openvinotoolkit#2751 follow up Sample Code Reference: https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/visual_language_chat/encrypted_model_vlm.py#L87 https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/cpp/text_generation/encrypted_model_causal_lm.cpp#L52 OPTIMIZE_SIZE and encryption are not included. The main performance concern for Whisper is pipeline speed. Since Whisper is much smaller than LLMs, size optimization offers only very little savings while potentially adding latency. Similarly, model encryption can also introduce additional latency.

Enable model caching for Whisper pipeline on GPU and NPU

daea5c5

luke-lin-vmc mentioned this pull request Sep 23, 2025

Added Model Caching example in Whisper README.md #2751

Closed

github-actions bot added the category: Whisper samples GenAI Whisper samples label Sep 23, 2025

as-suvorov reviewed Sep 23, 2025

View reviewed changes

Wovchena requested a review from Copilot October 16, 2025 11:24

Copilot AI reviewed Oct 16, 2025

View reviewed changes

Wovchena approved these changes Oct 16, 2025

View reviewed changes

Wovchena enabled auto-merge October 16, 2025 13:02

Wovchena added this pull request to the merge queue Oct 16, 2025

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Oct 16, 2025

Wovchena added this pull request to the merge queue Oct 16, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 17, 2025

Wovchena added this pull request to the merge queue Oct 17, 2025

Merged via the queue into openvinotoolkit:master with commit ad2dd5f Oct 17, 2025
92 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable model caching for Whisper pipeline on GPU and NPU #2759

Enable model caching for Whisper pipeline on GPU and NPU #2759

luke-lin-vmc commented Sep 23, 2025 •

edited

Loading

Uh oh!

as-suvorov Sep 23, 2025

Uh oh!

luke-lin-vmc Sep 23, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Wovchena commented Oct 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enable model caching for Whisper pipeline on GPU and NPU #2759

Enable model caching for Whisper pipeline on GPU and NPU #2759

Conversation

luke-lin-vmc commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

as-suvorov Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

luke-lin-vmc Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Wovchena commented Oct 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

luke-lin-vmc commented Sep 23, 2025 •

edited

Loading

luke-lin-vmc Sep 23, 2025 •

edited

Loading